Gist

The site https://www.mathgenealogy.org/, contains over 276,000 observations of Mathematics PhD grads and their supervisors. This is effectively a geneology of mathematical supervision (which should have some sizable effect on thinking, topics, and reading). The R package ggenealogy contains an example dataset from this source and facilitates the consumption and ploting of this type of data.

Given that my thesis was just certified I want to try to see if I can trace up the mathematical genealogy tree to visualize my thought-leading predecessors.

Setup

library(ggenealogy)
library(ggplot2)
library(magrittr)

data("statGeneal", package = "ggenealogy")
df <- statGeneal %>%
  #dplyr::filter(parent != "") %>%
  tibble::as_tibble()
print(df, n=3)
## # A tibble: 8,165 x 6
##   child            parent             gradYear country     
##   <chr>            <chr>                 <dbl> <chr>       
## 1 Nicolas Chopin   "Christian Robert"     2003 France      
## 2 Melvin Springer  "Everett Welker"       1947 UnitedStates
## 3 Shelemyahu Zacks ""                     1962 UnitedStates
##   school                                     
##   <chr>                                      
## 1 Université Pierre-et-Marie-Curie - Paris VI
## 2 University of Illinois at Urbana-Champaign 
## 3 Columbia University                        
##   thesis                                                                        
##   <chr>                                                                         
## 1 Applications of Sequential Monte Carlo methods to Bayesian Statistics         
## 2 Joint Sampling Distribution of Mean and Standard Deviation for a Chi-square U~
## 3 Optimal Strategies in Randomized Factorial Experiments                        
## # ... with 8,162 more rows
hist(df$gradYear)

Ok, about 8k observations where “all the parent-child relationships where both parent and child received an advanced degree of statistics as of June 6, 2015.” This may or may-not contain the need people I am looking for.

Note that grad year:

  • Is in the range [1864, 2015].
  • Median is 5 greater than mean (left skew)

Where in the world is …?

Through trial and error I know that Di Cook is not in the data. The original paper does have Thomas Lumley, another professor of interest. But perhaps first I will manual look up Cook’s genealogy.

Di, Di’s supersivor, and “grand-supervisor” are not in the list, may have to go to plan B, looking at Thomas Lumley. After looking at both parents and children, I know that Thomas has 1 child in the data; Petra Buzkova. From the paper, we can see that the oldest predescor is David Cox.

lumley_p <- grepl("Lumley", df$parent, fixed = TRUE)
sum(lumley_p)
## [1] 1
df[lumley_p, ]
## # A tibble: 1 x 6
##   child         parent        gradYear country      school                  
##   <chr>         <chr>            <dbl> <chr>        <chr>                   
## 1 Petra Buzkova Thomas Lumley     2004 UnitedStates University of Washington
##   thesis                                                                        
##   <chr>                                                                         
## 1 Marginal Regression Analysis of Longitudinal Data with Irregular, Biased Samp~
## Prep the network info, more on this in `As network layout (iGraph)`.
ig <- dfToIG(df)

Finding a path

Let’s grab the paths while we are on the topic of names. Actually, if we go all the way to Buzkova, this is the example case in the paper.

pathCB <- getPath("David Cox", "Petra Buzkova", ig, df,
                  "gradYear", isDirected = FALSE)
plotPath(pathCB, df, "gradYear", fontFace = 4) +
  xlab("Graduation Year") +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 10)) +
  scale_x_continuous(expand = c(0.1, 0.2))

Good, we have a start. We will want to find a way to traverse the hierarchy to find all of the ancestors without filling in the cousin nodes (or more preferably faintly filling them in). As an example poster, see https://www.mathgenealogy.org/posters/raich.pdf.

Making trees

We can look at trees from a top-down or bottom-up view. Top-down works well, though bottom-up not so much, at least with this data and these functions. Of particular notice, is that the later case contains all 1:1 student:advisers. Studying the example poster we see that

l <- plotAncDes("David Cox", df, mAnc = 1, mDes = 6, vCol = "blue") +
  labs(subtitle = "Interesting, but too many \n  cousins of Thomas Lumley")
r <- plotAncDes("Thomas Lumley", df, mAnc = 6, mDes = 1, vCol = "blue") +
 labs(subtitle =  "Not very interesting, \n  nb only 1:1 relationships")

library(patchwork)
l + r

Look for a better tree

I looked at a few of the late children from the plotPathOnAll and by chance saw Hilary Parker, who co-hosts the Not so Standard Deviations, https://nssdeviations.com/, which I am a huge fan of. Let’s see if she has a better tree:

parker_p <- grepl("Parker", df$child, fixed = TRUE)
sum(parker_p)
## [1] 8
parkers <- df[parker_p, ] %>% dplyr::pull(child)

plotAncDes("Hilary Parker", df, mAnc = 1, mDes = 6, vCol = "blue") +
  labs(subtitle = "Hilary Parker")

Well, turns out none of (8) the Parker students have good trees. In my opinion the filter on the data requiring rows to be labeled as statistics focuses is too restrictive. Another short coming is that I haven’t seen an example of a student having multiple advisers.

Path on all

We can also highlight a path against the backdrop of the rest of the data placed with iterating y-axis height. It looks neat, but seems a bit arbitrary.

plotPathOnAll(pathCB, df, ig, "gradYear",
              bin = 200, nodeSize = .5, pathNodeSize = 2.5,
              nodeCol = "grey60", edgeCol = "grey80",
              animate = TRUE) ## plotly static interaction not animated.

As network layouts (iGraph)

In network and graphs, the iGraph package is a long standing go-to. We can also get to such an object with dfToIG(). This opens the door to all sorts of layouts and other network-related functions.

ig <- dfToIG(df)
class(ig)
## [1] "igraph"
ig
## IGRAPH 62f7545 UNW- 7123 8165 -- 
## + attr: name (v/c), weight (e/n)
## + edges from 62f7545 (vertex names):
##  [1] Nicolas Chopin   --Christian Robert   Melvin Springer  --Everett Welker    
##  [3] Shelemyahu Zacks --                   James Sweeder    --                  
##  [5] Nino Kordzakhia  --                   Pavel Vanecek    --Zuzana Prášková   
##  [7] Shyamal De       --                   Thomas Willke    --                  
##  [9] Vasant Huzurbazar--                   Rita Engelhardt  --William Cumberland
## [11] Fred Andrews     --                   Arthur Albert    --                  
## [13] John Folks       --                   Arnold Goodman   --                  
## [15] William Pruitt   --                   Thomas Birkner   --                  
## + ... omitted several edges
getBasicStatistics(ig)
## $isConnected
## [1] TRUE
## 
## $numComponents
## [1] 1
## 
## $avePathLength
## [1] 2.801
## 
## $graphDiameter
## [1] 10
## 
## $numNodes
## [1] 7123
## 
## $numEdges
## [1] 8165
## 
## $logN
## [1] 8.871
plot(ig)

Conclusion

There is definitely potential to reproduce such geneology posters. Unfortunately, the data that was included in the package does not seem sufficient for our purposes.

Session info

## Packages used
pkgs <- c("ggenealogy", "ggplot2")
## Package & session info
devtools::session_info(pkgs)
## - Session info ---------------------------------------------------------------
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Windows 10 x64 (build 19044)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_United States.1252
##  ctype    English_United States.1252
##  tz       Australia/Sydney
##  date     2022-06-10
##  pandoc   2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
## 
## - Packages -------------------------------------------------------------------
##  package      * version date (UTC) lib source
##  askpass        1.1     2019-01-13 [1] CRAN (R 4.1.2)
##  base64enc      0.1-3   2015-07-28 [1] CRAN (R 4.1.1)
##  cli            3.3.0   2022-04-25 [1] CRAN (R 4.1.3)
##  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.1.2)
##  cpp11          0.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  crayon         1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
##  crosstalk      1.2.0   2021-11-04 [1] CRAN (R 4.1.2)
##  curl           4.3.2   2021-06-23 [1] CRAN (R 4.1.2)
##  data.table     1.14.2  2021-09-27 [1] CRAN (R 4.1.2)
##  digest         0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
##  dplyr          1.0.9   2022-04-28 [1] CRAN (R 4.1.3)
##  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.0.5)
##  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
##  farver         2.1.0   2021-02-28 [1] CRAN (R 4.1.2)
##  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
##  generics       0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
##  ggenealogy   * 1.0.1   2020-03-04 [1] CRAN (R 4.1.3)
##  ggplot2      * 3.3.6   2022-05-03 [1] CRAN (R 4.1.3)
##  glue           1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
##  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.1.1)
##  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
##  htmlwidgets    1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httr           1.4.3   2022-05-04 [1] CRAN (R 4.1.3)
##  igraph         1.3.1   2022-04-20 [1] CRAN (R 4.1.3)
##  isoband        0.2.5   2021-07-13 [1] CRAN (R 4.1.2)
##  jsonlite       1.8.0   2022-02-22 [1] CRAN (R 4.1.3)
##  labeling       0.4.2   2020-10-20 [1] CRAN (R 4.1.1)
##  later          1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lattice        0.20-45 2021-09-22 [1] CRAN (R 4.1.3)
##  lazyeval       0.2.2   2019-03-15 [1] CRAN (R 4.1.2)
##  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
##  magrittr     * 2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
##  MASS           7.3-57  2022-04-22 [1] CRAN (R 4.1.3)
##  Matrix         1.4-1   2022-03-23 [1] CRAN (R 4.1.3)
##  mgcv           1.8-40  2022-03-29 [1] CRAN (R 4.1.3)
##  mime           0.12    2021-09-28 [1] CRAN (R 4.1.1)
##  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.1.1)
##  nlme           3.1-157 2022-03-25 [1] CRAN (R 4.1.3)
##  openssl        2.0.2   2022-05-24 [1] CRAN (R 4.1.3)
##  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
##  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
##  plotly         4.10.0  2021-10-09 [1] CRAN (R 4.1.2)
##  plyr           1.8.7   2022-03-24 [1] CRAN (R 4.1.3)
##  promises       1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.0.3)
##  R6             2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
##  RColorBrewer   1.1-3   2022-04-03 [1] CRAN (R 4.1.3)
##  Rcpp           1.0.8.3 2022-03-17 [1] CRAN (R 4.1.3)
##  reshape2       1.4.4   2020-04-09 [1] CRAN (R 4.1.2)
##  rlang          1.0.2   2022-03-04 [1] CRAN (R 4.1.3)
##  scales         1.2.0   2022-04-13 [1] CRAN (R 4.1.3)
##  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
##  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.1.2)
##  sys            3.4     2020-07-23 [1] CRAN (R 4.1.2)
##  tibble         3.1.7   2022-05-03 [1] CRAN (R 4.1.3)
##  tidyr          1.2.0   2022-02-01 [1] CRAN (R 4.1.2)
##  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
##  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
##  vctrs          0.4.1   2022-04-13 [1] CRAN (R 4.1.3)
##  viridisLite    0.4.0   2021-04-13 [1] CRAN (R 4.1.2)
##  withr          2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
##  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
## 
##  [1] C:/Users/spyri/Documents/R/win-library/4.1
##  [2] C:/Program Files/R/R-4.1.2/library
## 
## ------------------------------------------------------------------------------